Term Translations in Parallel Corpora: Discovery and Consistency Check

نویسنده

  • Dan Tufis
چکیده

The paper describes a method for identifying term translations in parallel corpora, developed within the FF-POIROT European project. This project aims at building multilingual (Dutch, Italian, French and English) resources in the financial/legal domain that may be used in knowledge and information systems by investigative bodies, and law enforcement in order to detect, investigate or help prevent instances of actual or attempted financial fraud. The methodology builds on our word alignment procedure based on translation equivalents extracted from parallel corpora. When a validated list of multiword terms is available in one language, the procedure provides the translations in any of the languages present in the parallel corpus. Given that a term is usually semantically nonambiguous, the found translations of different occurrences of the same term should be the same (modulo inflectional variations). If this is not the case, one might suspect a non-systematic translation of the original term. When a man-made term list is not available, the system tries to discover the term candidates extracting sequences of words that appear together more frequently than expected by chance. By the procedure mentioned before, the candidate terms occurrences in one language are linked to their translation equivalents in the other languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment

We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE ...

متن کامل

Machine Translation of Bio-Thesauri

In this paper we describe how we applied a generalpurpose machine translation tool for translating biomedical thesauri. We used corresponding terms in parallel corpora to check the validity of the translations. The advantage of this approach is that a single corresponding set of terms can be verified where techniques to retrieve translations from a parallel corpus do not exploit the knowledge c...

متن کامل

A Methodology for Bilingual Lexicon Extraction from Comparable Corpora

Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in sev...

متن کامل

Using Parallel Corpora to enrich Multilingual Lexical Resources

This paper describes the use of a bilingual vector model for the automatic discovery of German translations of English terms. The model is built by analysing co-occurence patterns in a parallel corpus of English and German medical abstracts, a method also used for CrossLingual Information Retrieval. The model generates candidate German translations of English words using the cosine similarity m...

متن کامل

Learning to Find Translations and Transliterations on the Web based on Conditional Random Fields

In recent years, state-of-the-art cross-linguistic systems have been based on parallel corpora. Nevertheless, it is difficult at times to find translations of a certain technical term or named entity even with a very large parallel corpora. In this paper, we present a new method for learning to find translations on the Web for a given term. In our approach, we use a small set of terms and trans...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004